library(arm)
library(tidyverse)Lecture Week 05: linear models
1 Overview
2 Setup
2.0.1 Libraries
You may need to install the arm package with the following:
install.packages("arm")
Remember to only run this once in your console, and to NOT put it into your script.
2.0.2 Data
For this lab, we will be using data from an experiment that Charles Darwin performed on the effects of self-pollination on Zea maize plants.
Darwin was interested in determining if high genetic diversity also led to more healthy plants, and it is assumed that self-pollinated plants will have lower genetic diversity
Briefly, Darwin was interested in whether or not there was a difference in fitness of seeds which were produced from self- or cross-pollinated plants.
Here, the height of the plants was used as a surrogate for fitness.
We will be using the
darwin_maize.csvdata, which you can download from D2L.Be sure to put it into your
data/folder.When we read in the data, we will assign it to the object called
darwin_maize.
darwin_maize <- read.csv("data/darwin_maize.csv",
stringsAsFactors = TRUE)3 Outline
What is a model?
What is a linear model?
Basic linear models (problems 1-3, 5)
Confidence intervals for model coefficients (problem 4)
Linear Model Assumptions
4 Lecture
4.0.1 What is a model?
“An informative representation of an object, person, or system.
Many types of models
Conceptual
Graphical
Mathematical
We will be working with statistical models
- Mathematical representation of our hypotheses
By necessity, models will be simplifications of reality
“All models are wrong, some are useful”
Inference requires models
Models link observations to processes
4.0.2 What is a linear model?
- What is the equation of a line?
Answer is probably something similar to:
\[\Huge y = a + bx\]
x <- seq(from = -10, to = 10, by = 0.1)
line_df <- data.frame(x = x,
y = -5 + 2 * x)
ggplot(line_df, aes(x = x, y = y)) +
geom_path(color = "#446E9B", linewidth = 2) +
theme_bw()- In reality, not every point is exactly on the line, right?
- Add stochasticity:
\[\Huge y_i = a + bx + \epsilon_i\]
- \(\epsilon\) is “epsilon”, and often represents “error” or “noise”
set.seed(23)
obs_df <- line_df |>
sample_n(50) |>
rowwise() |>
mutate(obs = x + rnorm(1, 0, 2))
ggplot(line_df, aes(x = x, y = y)) +
geom_path(color = "black", linewidth = 2) +
geom_point(data = obs_df,
aes(x = obs, y = y),
size = 2,
color = "dodgerblue") +
theme_bw()4.0.3 Basic linear models (problems 1-3, 5)
First, let’s modify our variables in the linear model
our intercept was \(a\) but we are going to change it to \(\beta_0\)
- “Beta zero”
the \(b\) will be called \(\beta_1\)
“beta one”
This is much more flexible as we can have \(\beta_2\), \(\beta_3\) … \(\beta_n\)
Linear models have the following components:
\[\Large response = deterministic\; part+stochastic\; part\]
- Write the model showing how our expected value \(E[y_i]\) is a function of:
\[\underbrace{\LARGE E[y] = \beta_0 + \beta_1 \times x_1}_{Deterministic}\]
- With the addition of a stochastic error:
\[\underbrace{\LARGE y_i \sim normal(E[y_i], \sigma)}_{Stochastic}\]
- Add all together:
\[\LARGE y_i = \beta_0 + \beta_1 \times x_1 + \epsilon_i\]
note that in these examples, the \(i\) subscript corresponds to individual observations
The \(1\) subscript refers to the predictor
This could be different group identities within a category
Or we could have multiple \(x\)’s (we will come back to this later)
4.0.3.1 Global Intercept model
If there are no groups and no continuous predictors, we can fit a model which is calculating the global intercept
Here, \(x_1 = 0\) so \(\beta_1 \times 0 = 0\) so we can simplify it
In other words, this is calculating the average value of our response
\[\LARGE E[y] = \beta_0\]
- Let’s look at the
darwin_maizedata
names(darwin_maize)[1] "pot" "pair" "type" "height"
head(darwin_maize) pot pair type height
1 I 1 Cross 23.500
2 I 1 Self 17.375
3 I 2 Cross 12.000
4 I 2 Self 20.375
5 I 3 Cross 21.000
6 I 3 Self 20.000
We are interested in the
heightcolumn as our response.- Later we will look at the
typefor our groups
- Later we will look at the
Let’s plot the response variable,
height, in thedarwin_maizedata
ggplot(darwin_maize,
aes(y = height,
x = 0)) +
geom_point() +
theme_bw()- We can also just plot the mean of this variable
ggplot(darwin_maize,
aes(x = 0,
y = height)) +
stat_summary(fun = mean,
geom = "point",
color = "black",
shape = 15,
size = 6) +
theme_bw()Now let’s fit our first linear model using the
lm()functionWhen fitting a model, you will use the syntax of:
response_column ~ predictor_columnIn this case there is no predictor column, so we just use the number
1
dlm0 <- lm(formula = height ~ 1,
data = darwin_maize)dlm0is a new object typeThis is a list
Each element in the list contains different information about the linear model
mode(dlm0)[1] "list"
names(dlm0) [1] "coefficients" "residuals" "effects" "rank"
[5] "fitted.values" "assign" "qr" "df.residual"
[9] "call" "terms" "model"
str(dlm0)List of 11
$ coefficients : Named num 18.9
..- attr(*, "names")= chr "(Intercept)"
$ residuals : Named num [1:30] 4.62 -1.51 -6.88 1.49 2.12 ...
..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
$ effects : Named num [1:30] -103.428 -2.221 -7.596 0.779 1.404 ...
..- attr(*, "names")= chr [1:30] "(Intercept)" "" "" "" ...
$ rank : int 1
$ fitted.values: Named num [1:30] 18.9 18.9 18.9 18.9 18.9 ...
..- attr(*, "names")= chr [1:30] "1" "2" "3" "4" ...
$ assign : int 0
$ qr :List of 5
..$ qr : num [1:30, 1] -5.477 0.183 0.183 0.183 0.183 ...
.. ..- attr(*, "dimnames")=List of 2
.. .. ..$ : chr [1:30] "1" "2" "3" "4" ...
.. .. ..$ : chr "(Intercept)"
.. ..- attr(*, "assign")= int 0
..$ qraux: num 1.18
..$ pivot: int 1
..$ tol : num 1e-07
..$ rank : int 1
..- attr(*, "class")= chr "qr"
$ df.residual : int 29
$ call : language lm(formula = height ~ 1, data = darwin_maize)
$ terms :Classes 'terms', 'formula' language height ~ 1
.. ..- attr(*, "variables")= language list(height)
.. ..- attr(*, "factors")= int(0)
.. ..- attr(*, "term.labels")= chr(0)
.. ..- attr(*, "order")= int(0)
.. ..- attr(*, "intercept")= int 1
.. ..- attr(*, "response")= int 1
.. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. ..- attr(*, "predvars")= language list(height)
.. ..- attr(*, "dataClasses")= Named chr "numeric"
.. .. ..- attr(*, "names")= chr "height"
$ model :'data.frame': 30 obs. of 1 variable:
..$ height: num [1:30] 23.5 17.4 12 20.4 21 ...
..- attr(*, "terms")=Classes 'terms', 'formula' language height ~ 1
.. .. ..- attr(*, "variables")= language list(height)
.. .. ..- attr(*, "factors")= int(0)
.. .. ..- attr(*, "term.labels")= chr(0)
.. .. ..- attr(*, "order")= int(0)
.. .. ..- attr(*, "intercept")= int 1
.. .. ..- attr(*, "response")= int 1
.. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv>
.. .. ..- attr(*, "predvars")= language list(height)
.. .. ..- attr(*, "dataClasses")= Named chr "numeric"
.. .. .. ..- attr(*, "names")= chr "height"
- attr(*, "class")= chr "lm"
We can use functions to help us display the important parts of the model
Today we will use
display()from thearmpackage.Later, we will use
summary()- NOT to be confused with
summarize()
- NOT to be confused with
display(dlm0)lm(formula = height ~ 1, data = darwin_maize)
coef.est coef.se
(Intercept) 18.88 0.58
---
n = 30, k = 1
residual sd = 3.18, R-Squared = 0.00
This output is called a table of coefficients.
There is a row for each coefficient
- in this case, just the intercept
the columns show the estimate as well as the standard error of the estimate.
Coefficient tables in R always list the first coefficient as the intercept
it can be sometimes tricky to figure out what it’s actually referring to.
In this case, we can confirm it’s the global average by using the
mean()function, either on it’s own or withdplyr:
mean(darwin_maize$height)[1] 18.88333
darwin_maize %>%
summarize(global_avg = mean(height)) global_avg
1 18.88333
The number of observations is displayed in the table of coefficients as
n,The number of parameters estimated is shown with
k.It also shows us the \(R^2\) value,
tells us how much variation is explained based on the predictor variables used in our model.
\(R^2\) values range from 1 (perfect explanation) to 0 (no explanation).
In this case we don’t have any predictor variables, so the intercept-only model explains 0% of the variation in the response variable.
The table also shows us the residual standard deviation, which in this case is the same as using sd() since we are ignoring group-levels.
sd(darwin_maize$height)[1] 3.180953
4.0.4 Returning back to the model
- This model:
\[\LARGE E[y] = \beta_0\]
Only shows us the expected value of \(y\)
- In other words, just the average
But our data has lots of observations
In order to account for that variability, we modify our model to be this:
\[\LARGE E[y_i] = \beta_0 + \epsilon_i\]
Now, our individual observations ( \(y_i\) ) can be explained by our intercept \(\beta_0\) plus some individual variation, \(\epsilon_i\)
So, we can now write out our distribution for \(\epsilon\)
\[\LARGE \epsilon \sim N(0, \sigma)\]
So, if we wanted to simulate new data observations with
rnorm(), what values would we put in?- i.e.,
mean = ?, sd = ?
- i.e.,
rnorm(10, mean = 18.88, sd = 3.18) [1] 12.36042 17.12835 17.03791 20.55950 18.43558 23.79608 22.61248 20.91501
[9] 18.72571 17.49878
- Finally, we can plot our full data with our estimated mean to illustrate this
ggplot(darwin_maize,
aes(y = height,
x = 0)) +
geom_point(color = "dodgerblue") +
stat_summary(fun = mean,
geom = "point",
color = "black",
shape = 15,
size = 6) +
theme_bw()4.0.4.1 Linear Model with Categorical Predictor
Now let’s add a categorical predictor
This is a predictor which classifies our observations into 2 or more groups
For the
darwin_maize, this is indicated in thetypecolumn and has groups ofselfandcross
Let’s go back to our original model
\[\LARGE E[y_j] = \beta_0 + \beta_1 \times x_1\]
Here, \(y_j\) is the expected value of each group, where \(y\) is our response, and \(j\) changes with each group
To calculate the expected value of each group, we need to introduce “dummy variables”
for our reference group, \(x_1 = 0\)
For our second group, \(x_1 = 1\)
If we had more groups, we would have more \(x_j\)’s and they would either be a 0 or 1.
- We will come back to this when we discuss ANOVA
Since \(x_1 = 0\) for our reference group, the expected value is:
\[\LARGE E[y_{reference}] = \beta_0 + \beta_1 \times 0 = \beta_0\]
In nearly ALL our models, the intercept ( \(\beta_0\) ) is going to represent the average value of our reference group
Figuring out what our reference group is can be tricky
- Since \(x_1 = 1\) for our other group, the expected value is:
\[\LARGE E[y_{other}] = \beta_0 + \beta_1 \times 1 = \beta_0 + \beta_1\]
- So to figure out the average value of our other group, we need to add \(\beta_0\) and \(\beta_1\)
- When there are multiple groups, you always start with \(\beta_0\) and add \(\beta_j\)
- You won’t need to know if the \(\beta_j\)’s for your model are 0 or 1, but it’s important to know what’s going on behind the scenes.
4.0.4.2 You should now be able to complete problem 1 in the homework assignment.
4.0.4.3 Darwin maize with category
Let’s look at our
darwin_maizedata againWe will do the following:
calucate group means
Display the data separated by group and show group-level means
ggplot(darwin_maize,
aes(x = type,
y = height,
color = type)) +
geom_point(position = position_jitter(
width = 0.1,
height = 0),
size = 2) +
stat_summary(fun = mean,
geom = "point",
color = "black",
shape = 8,
size = 6) +
theme_bw() +
scale_color_viridis_d(option = "inferno",
begin = 0.3,
end = 0.7)type_means <- darwin_maize |>
group_by(type) |>
summarise(m_height = mean(height),
sd_height = sd(height))
type_means# A tibble: 2 × 3
type m_height sd_height
<fct> <dbl> <dbl>
1 Cross 20.2 3.62
2 Self 17.6 2.05
- Now let’s fit a model using the categorical predictor
dlm1 <- lm(height ~ 1 + type,
data = darwin_maize)
display(dlm1)lm(formula = height ~ 1 + type, data = darwin_maize)
coef.est coef.se
(Intercept) 20.19 0.76
typeSelf -2.62 1.07
---
n = 30, k = 2
residual sd = 2.94, R-Squared = 0.18
Note that when you have predictor variables in the formula the 1 is no longer needed, so running lm(height ~ type, data = darwin_maize) will give you the same result.
The layout of this table is similar as before, but now we have two rows, one called “(Intercept)” and one called “typeSelf”.
Recall that there are only 2 categories in the
typevariable.Because
typeSelfis listed in the table, you would be correct in assuming that (Intercept) is the average height of the plants in the Cross group.
\[\LARGE E[y_{cross}] = \beta_0 = 20.19\]
You might also assume that the
typeSelfcoefficient estimate is the avergae of the other group of plants.However, this coefficient is showing the estimated difference in means.
So in order to get the mean for the Self group, we need to subtract the estimated difference from the estimate for the Cross group.
\[\LARGE E[y_{self}] = \beta_0 + \beta_1 \times 1 = 20.19 -2.62 = 17.57\]
These are the same values (with rounding) that we calculated manually above
Calculating estimated means in R
Can manually type the values
Or use the coefficient estimates directly
# estimated means
20.19[1] 20.19
20.19 - 2.62[1] 17.57
# using the coefficients
coef(dlm1)(Intercept) typeSelf
20.191667 -2.616667
# type = self
coef(dlm1)[1](Intercept)
20.19167
# type = cross
coef(dlm1)[1] + coef(dlm1)[2](Intercept)
17.575
The “(Intercept)” here is the name of the vector position.
You can remove it with this:
cross_est <- coef(dlm1)[1] + coef(dlm1)[2]
names(cross_est) <- NULL
cross_est[1] 17.575
SE mean for the intercept is the same formula we used in previous classes.
The SE difference is calculated with a different formula:
\[\large SED = \sqrt{\frac{s^2_{1}}{n_1} + \frac{s^2_{2}}{n_2}}\]
This is a “pooled” estimate of variation
I won’t ask you to do this manually, but just be aware that using
var()on each group will not get you the correct values.
4.0.4.4 You should now be able to complete problem 2 in the homework assignment.
4.0.5 Confidence intervals for model coefficients (problem 4)
- Recall that a 95% CI is approximated by:
\[\large 95\% CI = \bar{y} \pm 2*SEM\]
Similar to the SED calculation above, there is uncertainty in our estimates, our SEM and SED’s so we cannot simply multiply these together to estimate a 95% CI.
Luckily, there is an R function,
confint()which specifically calculates confidence intervals for coefficients from linear models.This function does not work on vectors. It only works on fitted model objects
confint(dlm1) 2.5 % 97.5 %
(Intercept) 18.63651 21.7468231
typeSelf -4.81599 -0.4173433
The output of
confint()is the lower bound (2.5%) and the upper bound (97.5%) of the confidence interval.So the 95% confidence interval for the average height of the Cross plants is: 18.64, 21.75, and the 95% CI for the average difference between the two groups is: -4.82, -0.42.
4.0.6 Interpretation
Recall that the hypothesis was that self-pollinating plants were less fit due to lower genetic diversity.
What does our data indicate?
If there was NO EFFECT of self-pollination, we would expect the heights of each group to be the same.
Or in other words, the difference ( \(\beta_1\) ) between the groups would be 0.
\(\beta_1 = 0\) means NO difference in group-level means
\(\beta_1 \ne 0\) means there IS A DIFFERENCE between the group-level means
The 95% CI for \(\beta_1\) is completely negative (does not include 0).
We are fairly confident that there is a difference in plant height.
Because the mean height of the Self-group of plants is lower, we can conclude the difference in height is consistent with the original hypothesis (self-pollinated plants are shorter and therefore less-fit).
The
armpackage has a nice function for displaying coefficient estimates:coefplot(). This function also requires the object to be a model fit (will not work on a vector).
coefplot(dlm1, xlim = c(-5, 0))I added the
xlim = c(-5, 0)argument to ensure that the plot displayed 0, which is our value of interest.When you’re doing this on the homework, you may need to play around with different limits on the x-axis.
So, at the 95% confidence level, we can say that the there is a decrease in mean plant height in self-pollinated plants.
We can test this same hypothesis at a higher level of confidence, i.e., 99% CI
4.0.7 99% CI and coefplot()
We can change the level of certainty in both the
confint()andcoefplot()functionsWe can calculate a 99% CI directly by adding
level = 0.99to theconfint()function.Unfortunately, the
coefplot()function only allows us to change the level of standard deviation. We can setsd = 3to get approximately the same values.
confint(dlm1, level = 0.99) 0.5 % 99.5 %
(Intercept) 18.093790 22.2895433
typeSelf -5.583512 0.3501789
coefplot(dlm1, sd = 3)Notice that at the 99% confidence level (or at 3 * the standard deviation), the interval now crosses 0.
In other words, at this more strict level of confidence, we cannot say that there is a difference in height between the two plant groups.
4.0.8 You should now be able to complete problem 3 in the homework assignment.
4.0.9 Linear Model Assumptions
It’s always important to remember that all models have assumptions, and you should check that those assumptions are reasonably met.
The 4 assumptions of a linear model are:
- Linearity
- Normality of residuals
- Equal variance (homoscedasticity) of residuals
- For categorical data, need to ensure equal variance across groups
- For categorical data, need to ensure equal variance across groups
- Independence
Assumption 1 is usually assessed graphically and will be more important when we get to regression analysis (continuous x-variables).
- For categorical data it is less often tested directly.
Testing assumption 4 directly is tough, so usually we assess this assumption based on our understanding of how the data were generated and observed.
For assumptions 2 and 3, even with categorical data, we can assess with graphs.
4.0.9.1 Assumption 2: normal distribution of residuals
Use a Q-Q plot
shows our standardized residuals compared with our theoretical quantiles.
We can do this using
ggplot().
We are using
aes(sample = )function has an argument calledsample.We do not use an
xoryargument.
ggplot(darwin_maize, aes(sample = height)) +
stat_qq() + # plot qq points
stat_qq_line() # reference lineIf the residuals were perfectly normal, all the points would be exactly on the line.
Here, the data in the middle fit pretty well, but the observations at the extreme ends are rather far off the line.
Luckily, linear models are fairly robust to this assumption.
In this case, it doesn’t look perfect, but it’s not terrible.
We will discuss how to deal with non-normal residuals in the future.
4.0.9.2 Assumption 3: equal variance of residuals
We will compare the fitted model values with the residuals.
Residuals are a measure of how far away the observation is from the prediction
Draw an estimated value on the board
Draw an empirical observation
Show the distance
For this we can again use
ggplot(),however, this time we are plotting the model object, not the raw data directly.
Note that I once again added a “jitter” argument to the width (x-axis) to help visualize the individual points but did not modify the position on the height (y-axis)
ggplot(dlm1,
aes(x = .fitted, y = .resid)) +
geom_point(
position = position_jitter(
width = 0.1,
height = 0)) +
geom_hline(yintercept = 0)Warning: `fortify(<lm>)` was deprecated in ggplot2 3.6.0.
ℹ Please use `broom::augment(<lm>)` instead.
ℹ The deprecated feature was likely used in the ggplot2 package.
Please report the issue at <https://github.com/tidyverse/ggplot2/issues>.
What we are looking for here is that the minimum and maximum y-values are the same across the x-values, and that they are approximately equally distributed above and below the reference line (y = 0).
Basically, no pattern moving left to right
Same heights (max and min values)
Here, we have a few points that are well below the rest (difference in range)
- negative values below -3
It appears to be unequal (most points above the 0 reference line).
Similar to the results above, this is not perfect, but also not terrible.
4.0.9.3 Categorical data: Check variance within groups
When we have categorical predictors, we also want to ensure that the variance between groups is approximately the same.
We can do this with a boxplot of the groups.
ggplot(darwin_maize,
aes(x = height,
y = type)) +
geom_boxplot()+
labs(title = "Example of approx. equal variances across groups")Here, the widths of the boxplots are approximately equal
Means that each group has approximately the same observed variation.
The dots on the left side for each group are not ideal, but this doesn’t look too concerning.
What would be concerning is if the widths of the boxes were different sizes. For example:
set.seed(2112)
df <- data.frame(x = c(rnorm(10, 10, 1),
rnorm(10, 20, 10)),
group = rep(c("A", "B"), each = 10))
ggplot(df,
aes(y = group,
x = x)) +
geom_boxplot() +
labs(title = "Example of unequal variances across groups")4.0.10 You should now be able to complete problem 4 in the homework assignment.
4.0.11 Re-level your data to change the “reference” level
We can modify the data to estimate the mean and SE for the other plant group (
Self).Right now, the
typecolumn is a factor.
class(darwin_maize$type)[1] "factor"
levels(darwin_maize$type)[1] "Cross" "Self"
The default behavior for R is to treat characters alphabetically.
We can change this by setting the levels of
typemanuallyI will first make a new object called
darwin_factor, which will be an exact copy ofdarwin_maize.I will then change the levels of the
typevariable directly.
# copy the data object
darwin_factor <- darwin_maize
# change the class of the type column
# and set the levels
darwin_factor$type <- factor(
darwin_factor$type,
levels = c("Self", "Cross"))
# what is the class of the new type column?
class(darwin_factor$type)[1] "factor"
# what are the order of the levels?
levels(darwin_factor$type)[1] "Self" "Cross"
- Now, we can perform the same linear model analysis as above with the
Selfgroup as our reference level.
display(lm(height~type, data = darwin_factor))lm(formula = height ~ type, data = darwin_factor)
coef.est coef.se
(Intercept) 17.58 0.76
typeCross 2.62 1.07
---
n = 30, k = 2
residual sd = 2.94, R-Squared = 0.18
Now we can see that the estimate and SE for the Intercept coefficient are different.
Likewise the second row of the table now says
typeCrossinstead oftypeSelf.
4.0.12 You should now be able to complete the homework assignment.
5 Summary
Models are ways of estimating group-level statistics from empirical data
Important to think about the mathematical formula
What group-level mean is represented by \(\beta_0\)?
How do you calculate the mean of the other group(s)?
- \(\beta_0 + \beta_j\)
You can re-level your data to set your “reference” level
All models have assumptions
Be sure to check your data to see if assumptions are reasonable